Sains Malaysiana 54(6)(2025): 1629-1639
http://doi.org/10.17576/jsm-2025-5406-17
SMOTE-PCADBSCAN:
A Novel Approach for Addressing Class Imbalance in Water Quality Prediction
(SMOTE-PCADBSCAN: Suatu Pendekatan Novel untuk Menangani Ketidakseimbangan Kelas dalam Ramalan Kualiti Air)
NORASHIKIN
NASARUDDIN1,2,*,
NURULKAMAL MASSERAN1, WAN MOHD RAZI IDRIS3 & AHMAD
ZIA UL-SAUFIE4
1Department of Mathematical Sciences, Faculty of Science and
Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi, Selangor, Malaysia
2Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA (UiTM) Kedah Branch, 08400 Merbok, Kedah, Malaysia
3Department of Earth Science and Environment, Faculty
of Science and Technology, Universiti Kebangsaan Malaysia, 43600 UKM Bangi,
Selangor, Malaysia
4Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA (UiTM), 40450 Shah Alam,
Selangor, Malaysia
Diserahkan: 12 Ogos 2024/Diterima: 13 Mac 2025
Abstract
An accurate and trustworthy prediction model is essential for supporting
policy decisions in environmental management concerning water quality
prediction. Nonetheless, imbalanced datasets are prevalent in this discipline
and hinder identifying crucial ecological factors accurately. This study proposed
a novel SMOTE-PCADBSCAN model to enhance the categorisation of water quality
data by employing three key components: (i) synthetic minority over-sampling technique (SMOTE), (ii) principal component analysis (PCA), and (iii)
density-based spatial clustering of applications with noise (DBSCAN). The minority class was initially augmented
using SMOTE, which PCA then decreased the dimensionality. Subsequently,
DBSCAN was utilised to generate superior-quality synthetic data by detecting
and eliminating extraneous data points. A Malaysia-based multi-class water
quality dataset was employed to determine the efficiency of this model. Four
different versions of the dataset (Original, SMOTE, SMOTE-DBSCAN, and
SMOTE-PCADBSCAN) also utilised five classifier types for the analysis process:
(i) decision tree, (ii) random forest, (iii) gradient
boosting method, (iv) adaptive boosting, and (v) extreme gradient boosting.
Although the original datasets exhibited great accuracy, class imbalance
occurred when detecting minority classes. Among the datasets, the metric
performances of SMOTE-DBSCAN and SMOTE-PCADBSCAN-based synthetic datasets were
superior. The highest accuracy and optimal F1 scores were also demonstrated by
RF using the SMOTE-PCADBSCAN approach, which presented excellent water quality
classification and imbalanced data management. Consequently, the classification
accuracy of imbalanced environmental datasets could be enhanced by employing
advanced oversampling techniques and ensemble approaches.
Keywords: DBSCAN; imbalanced
data; PCA; SMOTE; water quality
Abstrak
Model ramalan yang tepat dan boleh dipercayai adalah penting untuk menyokong keputusan dasar dalam pengurusan alam sekitar berkaitan ramalan kualiti air. Walau bagaimanapun, set data yang tidak seimbang sering berlaku dalam disiplin ini dan menghalang pengenalan faktor ekologi yang penting dengan tepat. Penyelidikan ini mencadangkan model
SMOTE-PCADBSCAN yang inovatif untuk meningkatkan pengelasan data kualiti air dengan menggunakan tiga komponen utama: (i) teknik pengambilan sampel berlebihan minoriti sintetik (SMOTE), (ii) analisis komponen utama (PCA) dan (iii) pengelompokan ruang berasaskan ketumpatan aplikasi dengan bunyi (DBSCAN). Kelas minoriti pada mulanya ditambah menggunakan SMOTE, yang kemudiannya mengalami pengurangan dimensi oleh PCA. Seterusnya,
DBSCAN digunakan untuk menghasilkan data sintetik berkualiti tinggi dengan mengesan dan menghapuskan titik data yang tidak relevan/berlebihan. Set data kualiti air pelbagai kelas dari Malaysia digunakan untuk menentukan keberkesanan model ini. Empat versi dataset yang berbeza (Asal,
SMOTE, SMOTE-DBSCAN dan SMOTE-PCADBSCAN) melibatkan lima jenis pengelas untuk proses analisis: (i) pokok keputusan,
(ii) hutan rawak, (iii) mesin penggalakan kecerunan, (iv) penggalakan adaptif dan (v) penggalakan kecerunan ekstrem. Walaupun dataset asal menunjukkan ketepatan yang tinggi, ketidakseimbangan kelas berlaku apabila mengesan kelas minoriti. Antara dataset, prestasi metrik dataset sintetik berasaskan SMOTE-DBSCAN dan SMOTE-PCADBSCAN adalah lebih baik. Ketepatan tertinggi dan skor F1 optimum juga ditunjukkan oleh RF menggunakan pendekatan SMOTE-PCADBSCAN yang menunjukkan prestasi cemerlang dalam pengelasan kualiti air dan pengurusan data tidak seimbang. Oleh itu, ketepatan pengelasan dataset alam sekitar yang tidak seimbang boleh dipertingkatkan dengan menggunakan teknik pengambilan sampel berlebihan lanjutan dan pendekatan ansambel.
Kata kunci: Data tidak seimbang; DBSCAN; kualiti air;
PCA; SMOTE
RUJUKAN
Abedinia, A. & Seydi, V. 2024. Building semi-supervised
decision trees with semi-cart algorithm. International Journal of Machine
Learning and Cybernetics 15: 4493-4510.
Ahmed, M.F., Mokhtar, M.B., Lim, C.K. & Majid, N.A.
2022. Identification of water pollution sources for better Langat River basin
management in Malaysia. Water 14(12): 1904.
Ahmed, M.F., Mokhtar, M.B., Alam, L., Mohamed, C.A.R. &
Ta, G.C. 2020. Investigating the status of cadmium, chromium and lead in the
drinking water supply chain to ensure drinking water quality in Malaysia. Water 12(10): 2653.
Alqahtani, A., Shah, M.I., Aldrees, A. & Javed, M.F.
2022. Comparative assessment of individual and ensemble machine learning models
for efficient analysis of river water quality. Sustainability 14(3): 1183.
Arafa, A., El-Fishawy, N., Badawy, M. & Radad, M.
2022. RN-SMOTE: Reduced noise SMOTE based on DBSCAN for enhancing imbalanced
data classification. Journal of King Saud University - Computer and Information
Sciences 34(8): 5059-5074.
Blahova, L., Horecny, J. & Kostolny, J. 2023.
Segmentation of MRI images using clustering algorithms. IEEE International
Conference on Information and Digital Technologies, IDT 2023. pp. 169-178.
Breiman, L. 2001. Random forests. Machine Learning 45: 5-32.
Chawla, N.V., Bowyer, K.W., Hall, L.O. & Kegelmeyer,
W.P. 2002. SMOTE: Synthetic minority over-sampling technique. Journal of
Artificial Intelligence Research 16: 321-357.
Chen, T. & Guestrin, C.
2016. XGBoost: A scalable tree boosting system. Proceedings of the ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining pp. 785-794.
Cheng, K., Zhang, C., Yu, H., Yang, X., Zou, H. & Gao,
S. 2019. Grouped SMOTE with noise filtering mechanism for classifying
imbalanced data. IEEE Access 7: 170668-170681.
Dalakleidi, K., Zarkogianni, K., Thanopoulou, A. &
Nikita, K. 2017. Comparative assessment of statistical and machine learning
techniques towards estimating the risk of developing type 2 diabetes and
cardiovascular complications. Expert Systems 34(6): e12214.
Department of Environment Malaysia. 2022. Laporan Kualiti Alam Sekeliling 2022. Putrajaya: Jabatan Alam Sekitar Malaysia.
Dogo, E.M., Nwulu, N.I., Twala, B. & Aigbavboa, C.
2021. Accessing imbalance learning using dynamic selection approach in water
quality anomaly detection. Symmetry 13(5): 818.
Dong, X., Yu, Z., Cao, W., Shi, Y. & Ma, Q. 2020. A survey
on ensemble learning. Frontiers of Computer Science 14(2): 241-258.
Douzas, G., Bacao, F. & Last, F. 2018. Improving imbalanced
learning through a heuristic oversampling method based on K-means and SMOTE. Information
Sciences 465: 1-20.
Ester, M., Kriegel, H.P., Sander, J. & Xu, X. 1996. A
density-based algorithm for discovering clusters in large spatial databases
with noise. KDD 96(34): 226-231.
Fitri, A., Maulud, K.N.A., Pratiwi, D., Phelia, A.,
Rossi, F. & Zuhairi, N.Z. 2020. Trend of water quality status in Kelantan
River downstream, Peninsular Malaysia. Jurnal Rekayasa Sipil (JRS-Unand) 16(3): 178-184.
Hashem, A.O.A., Ahmad, W.A.A.W. & Yusuf, S.Y. 2021.
Water quality status of Sungai Petani River, Kedah, Malaysia. IOP Conference
Series: Earth and Environmental Science 646(1): 012028.
Jeatrakul, P., Wong, K.W. & Fung, C.C. 2010.
Classification of imbalanced data by combining the complementary neural network
and SMOTE algorithm. Proceedings of the 17th International Conference on Neural Information Processing (ICONIP 2010), Part II, Sydney, Australia, pp. 152-159. Springer.
Kanungo, T., Mount, D.M., Netanyahu, N.S., Piatco, C.D.,
Silverman, R. & Wu, A.Y. 2002. An efficient k-means clustering algorithm:
Analysis and implementation. IEEE Transactions on Pattern Analysis and
Machine Learning 24(7): 881-892.
Kavitha, R.J. & Caroline, B.E. 2015. Hybrid cryptographic
technique for heterogeneous wireless sensor networks. 2015 International
Conference on Communication and Signal Processing, ICCSP 2015. pp. 1016-1020.
Kumar, K.M. & Reddy, A.R.M. 2016. A fast DBSCAN clustering
algorithm by accelerating neighbor searching using groups method. Pattern
Recognition 58: 39-48.
Marsboom, C., Vrebos, D., Staes, J. & Meire, P. 2018.
Using dimension reduction PCA to identify ecosystem service bundles. Ecological
Indicators 87: 209-260.
Mustakim, E., Rahmi, M.R, Mundzir, S.T., Rizaldi, Okfalisa & Maita, I. 2021. Comparison of DBSCAN and PCA-DBSCAN Algorithm
for Grouping Earthquake Area. In: Proceedings of the 2021 International Congress of Advanced
Technology and Engineering (ICOTEN 2021), Taiz, Yemen, pp. 1-5.
Poudevigne-Durance, T. 2024. Generative adversarial
networks for the synthesis of unbalanced irregular time series. Doctoral
dissertation, Cardiff University (Unpublished).
Rahman, M.A., Hossain, M.F., Hossain, M. & Ahmmed, R.
2020. Employing PCA and T-statistical approach for feature extraction and
classification of emotion from multichannel EEG signal. Egyptian Informatics
Journal 21(1): 23-35.
Sander, J., Ester, M., Kriegel, H.P. & Xu, X. 1998.
Density-based clustering in spatial databases: The algorithm GDBSCAN and its
applications. Data Mining and Knowledge Discovery 2: 169-194.
Sarker, I.H. 2021. Machine learning: Algorithms, real-world
applications and research directions. SN Computer Science 2(3): 160.
Schapire, R.E. 1999. A brief introduction to boosting. IJCAI
International Joint Conference on Artificial Intelligence 99(999): 1401-1406.
Shehab, S.A., Darwish, A., Hassanien, A.E. &
Scientific Research Group in Egypt. 2023. Water quality classification model
with small features and class imbalance based on fuzzy rough sets. Environment,
Development and Sustainability 27: 1401-1419.
Shen, X., Hu, H., Li, X. & Li, S. 2021. Study on
PCA-SAFT imaging using leaky Rayleigh waves. Measurement 170: 108708.
Starczewski, A., Goetzen, P. & Er, M.J. 2020. A new
method for automatic determining of the DBSCAN parameters. Journal of
Artificial Intelligence and Soft Computing Research 10(3): 209-221.
Taloor, A.K., Sambyal, S., Sharma, R., Dev, S., Shastri,
S. & Kumar, R. 2025. Advanced hydrogeochemical facies classification: A
comparative analysis of Machine Learning models with SMOTE in the Tawi
basin. Physics and Chemistry of the Earth, Parts A/B/C 137:
103785.
Tran, T.N., Drab, K. & Daszykowski, M. 2013. Revised
DBSCAN algorithm to cluster data with dense adjacent clusters. Chemometrics
and Intelligent Laboratory Systems 120: 92-96.
Wong, W.Y., Hasikin, K., Khairuddin, M., Salwa, A.,
Razak, S.A., Hizaddin, H.F., Mokhtar, M.I. & Azizan, M.M. 2023. A stacked
ensemble deep learning approach for imbalanced multi-class water quality index
prediction. Comput. Mater. Contin. 76(2): 1361-1384.
Yasin, M.I. & Karim, S.A.A. 2020. A new fuzzy
weighted multivariate regression to predict water quality index at Perak
Rivers. In S. Karim, E. Kadir & A. Nasution (Eds.), Optimization Based Model Using Fuzzy and Other Statistical
Techniques Towards Environmental Sustainability (pp. 1-27). Singapore: Springer. pp. 1-27.
*Pengarang untuk surat-menyurat; email: norashikin116@uitm.edu.my